In this work, we investigate improving the generalizability of GAN-generated image detectors by performing data augmentation in the fingerprint domain. Specifically, we first separate the fingerprints and contents of the GAN-generated images using an autoencoder based GAN fingerprint extractor, followed by random perturbations of the fingerprints. Then the original fingerprints are substituted with the perturbed fingerprints and added to the original contents, to produce images that are visually invariant but with distinct fingerprints. The perturbed images can successfully imitate images generated by different GANs to improve the generalization of the detectors, which is demonstrated by the spectra visualization. To our knowledge, we are the first to conduct data augmentation in the fingerprint domain. Our work explores a novel prospect that is distinct from previous works on spatial and frequency domain augmentation. Extensive cross-GAN experiments demonstrate the effectiveness of our method compared to the state-of-the-art methods in detecting fake images generated by unknown GANs.
translated by 谷歌翻译
Denoising Diffusion Probabilistic Models (DDPMs) are emerging in text-to-speech (TTS) synthesis because of their strong capability of generating high-fidelity samples. However, their iterative refinement process in high-dimensional data space results in slow inference speed, which restricts their application in real-time systems. Previous works have explored speeding up by minimizing the number of inference steps but at the cost of sample quality. In this work, to improve the inference speed for DDPM-based TTS model while achieving high sample quality, we propose ResGrad, a lightweight diffusion model which learns to refine the output spectrogram of an existing TTS model (e.g., FastSpeech 2) by predicting the residual between the model output and the corresponding ground-truth speech. ResGrad has several advantages: 1) Compare with other acceleration methods for DDPM which need to synthesize speech from scratch, ResGrad reduces the complexity of task by changing the generation target from ground-truth mel-spectrogram to the residual, resulting into a more lightweight model and thus a smaller real-time factor. 2) ResGrad is employed in the inference process of the existing TTS model in a plug-and-play way, without re-training this model. We verify ResGrad on the single-speaker dataset LJSpeech and two more challenging datasets with multiple speakers (LibriTTS) and high sampling rate (VCTK). Experimental results show that in comparison with other speed-up methods of DDPMs: 1) ResGrad achieves better sample quality with the same inference speed measured by real-time factor; 2) with similar speech quality, ResGrad synthesizes speech faster than baseline methods by more than 10 times. Audio samples are available at https://resgrad1.github.io/.
translated by 谷歌翻译
Graphic User Interface (GUI) is facing great demand with the popularization and prosperity of mobile apps. Automatic UI code generation from UI design draft dramatically simplifies the development process. However, the nesting layer structure in the design draft affects the quality and usability of the generated code. Few existing GUI automated techniques detect and group the nested layers to improve the accessibility of generated code. In this paper, we proposed our UI Layers Group Detector as a vision-based method that automatically detects images (i.e., basic shapes and visual elements) and text layers that present the same semantic meanings. We propose two plug-in components, text fusion and box attention, that utilize text information from design drafts as a priori information for group localization. We construct a large-scale UI dataset for training and testing, and present a data augmentation approach to boost the detection performance. The experiment shows that the proposed method achieves a decent accuracy regarding layers grouping.
translated by 谷歌翻译
Error correction in automatic speech recognition (ASR) aims to correct those incorrect words in sentences generated by ASR models. Since recent ASR models usually have low word error rate (WER), to avoid affecting originally correct tokens, error correction models should only modify incorrect words, and therefore detecting incorrect words is important for error correction. Previous works on error correction either implicitly detect error words through target-source attention or CTC (connectionist temporal classification) loss, or explicitly locate specific deletion/substitution/insertion errors. However, implicit error detection does not provide clear signal about which tokens are incorrect and explicit error detection suffers from low detection accuracy. In this paper, we propose SoftCorrect with a soft error detection mechanism to avoid the limitations of both explicit and implicit error detection. Specifically, we first detect whether a token is correct or not through a probability produced by a dedicatedly designed language model, and then design a constrained CTC loss that only duplicates the detected incorrect tokens to let the decoder focus on the correction of error tokens. Compared with implicit error detection with CTC loss, SoftCorrect provides explicit signal about which words are incorrect and thus does not need to duplicate every token but only incorrect tokens; compared with explicit error detection, SoftCorrect does not detect specific deletion/substitution/insertion errors but just leaves it to CTC loss. Experiments on AISHELL-1 and Aidatatang datasets show that SoftCorrect achieves 26.1% and 9.4% CER reduction respectively, outperforming previous works by a large margin, while still enjoying fast speed of parallel generation.
translated by 谷歌翻译
We shed light on a pitfall and an opportunity in physics-informed neural networks (PINNs). We prove that a multilayer perceptron (MLP) only with ReLU (Rectified Linear Unit) or ReLU-like Lipschitz activation functions will always lead to a vanished Hessian. Such a network-imposed constraint contradicts any second- or higher-order partial differential equations (PDEs). Therefore, a ReLU-based MLP cannot form a permissible function space for the approximation of their solutions. Inspired by this pitfall, we prove that a linear PDE up to the $n$-th order can be strictly satisfied by an MLP with $C^n$ activation functions when the weights of its output layer lie on a certain hyperplane, as called the out-layer-hyperplane. An MLP equipped with the out-layer-hyperplane becomes "physics-enforced", no longer requiring a loss function for the PDE itself (but only those for the initial and boundary conditions). Such a hyperplane exists not only for MLPs but for any network architecture tailed by a fully-connected hidden layer. To our knowledge, this should be the first PINN architecture that enforces point-wise correctness of a PDE. We give the closed-form expression of the out-layer-hyperplane for second-order linear PDEs and provide an implementation.
translated by 谷歌翻译
激活函数是元素的数学函数,在深神经网络(DNN)中起着至关重要的作用。已经提出了许多新颖和复杂的激活功能来提高DNN的准确性,但在训练过程中还可以通过反向传播消耗大量记忆。在这项研究中,我们提出了嵌套的正向自动分化(正向AD),专门针对用于记忆效率的DNN训练的元素激活函数。我们在两个广泛使用的深度学习框架(Tensorflow和Pytorch)中部署了嵌套的AD,分别支持静态和动态计算图。我们的评估表明,在相同的记忆降低率下,嵌套的前AD嵌套将记忆足迹降低到1.97倍,比基线模型降低了20%。
translated by 谷歌翻译
分布式隐私的回归方案已在各个领域开发和扩展,在这些领域中,多方协作和私人运行优化算法,例如梯度下降,以学习一组最佳参数。但是,传统的基于梯度的方法无法解决包含具有L1正则化的客观功能的问题,例如LASSO回归。在本文中,我们介绍了一个名为FCD的新分布式方案联合坐标下降,旨在在多方场景下安全地解决此问题。具体而言,通过安全的聚合和添加的扰动,我们的方案确保:(1)没有向其他方泄漏本地信息,并且(2)全局模型参数不会暴露于云服务器。最终,各方可以消除附加的扰动,以得出具有高性能的全球模型。我们表明,FCD方案填补了多方安全坐标下降方法的空白,并且适用于一般线性回归,包括线性,脊和拉索回归。理论安全分析和实验结果表明,可以有效,有效地执行FCD,并以低MAE度量作为在现实世界UCI数据集的三种线性回归的任务下作为集中方法提供的低MAE度量。
translated by 谷歌翻译
遥控传感器图像对象检测是地球观察的重要技术,可用于各种任务,例如森林火灾监测和海洋监测。尽管有很大的发展,但图像对象检测技术尽管有很大的发展,但由于小对象的像素有限,因此仍在努力处理遥控传感器图像和小规模对象。许多现有的研究表明,促进小物体检测的有效方法是引入空间环境。同时,最近对图像分类的研究表明,光谱卷积操作比空间域更有效地感知频域中的长期空间依赖性。受到这一观察的启发,我们提出了用于遥感对象检测的频率感知功能金字塔框架(FFPF),该框架由新型的频率感知重新NET(F-RESNET)和双侧光谱感知特征特征网络(BS-FPN(BS-FPN)组成(BS-FPN)(BS-FPN) )。具体而言,提出了F-Resnet通过将频域卷积插入主链的每个阶段,从而提取了小物体的更丰富特征来感知光谱上下文信息。据我们所知,这是第一项将频域卷积引入遥感对象检测任务的工作。此外,BSFPN旨在使用双边采样策略和跳过连接,以更好地对象在不同尺度上的对象特征的关联进行建模,以从F-Resnet中释放光谱上下文信息的潜力。进行了广泛的实验,以在光学遥感图像数据集(DIOR和DOTA)中进行对象检测。实验结果证明了我们方法的出色性能。它可以达到平均准确性(地图),没有任何技巧。
translated by 谷歌翻译
人类视频运动转移(HVMT)的目的是鉴于源头的形象,生成了模仿驾驶人员运动的视频。 HVMT的现有方法主要利用生成对抗网络(GAN),以根据根据源人员图像和每个驾驶视频框架估计的流量来执行翘曲操作。但是,由于源头,量表和驾驶人员之间的巨大差异,这些方法始终会产生明显的人工制品。为了克服这些挑战,本文提出了基于gan的新型人类运动转移(远程移动)框架。为了产生逼真的动作,远遥采用了渐进的一代范式:它首先在没有基于流动的翘曲的情况下生成每个身体的零件,然后将所有零件变成驾驶运动的完整人。此外,为了保留自然的全球外观,我们设计了一个全球对齐模块,以根据其布局与驾驶员的规模和位置保持一致。此外,我们提出了一个纹理对准模块,以使人的每个部分都根据纹理的相似性对齐。最后,通过广泛的定量和定性实验,我们的远及以两个公共基准取得了最先进的结果。
translated by 谷歌翻译
数据驱动的设计和创新是重复使用和提供宝贵和有用信息的过程。但是,现有的设计创新语义网络基于仅限于技术和科学信息的数据源。此外,现有研究仅在统计或语义关系上建立语义网络的边缘,这不太可能充分利用两种类型的关系中的好处,并发现设计创新的隐性知识。因此,我们构建了基于Wikipedia的语义网络Wikilink。 Wikilink引入了概念之间的统计重量和语义权重的合并重量,并开发了四种算法来启发新想法。进行评估实验,结果表明,该网络的特征是术语,关系和学科的高度覆盖范围,这证明了网络的有效性和实用性。然后,演示和案例研究结果表明,Wikilink可以作为概念设计创新的思想生成工具。 Wikilink的源代码和后端数据提供开源,供更多用户探索和构建。
translated by 谷歌翻译